As part of my lesson in SMU Visual Analytics, prof teaches data visualization in R using Rmarkdown. The post below acts as both an in-class exercise for the lesson, as well as my notes.
Note to prof, if he is here: I have reorganized the in class exercise in a way that makes sense to me, but may not follow the step by step process prof went through in class. I have added additional items which I have kept as haphazard notes on my desktop. Hope this is alright!
Below is shortcut of how to load the packages in one shot. We can choose to add in whatever packages we want to load in the variable packages. This is an alternative code as oppose to loading the package line by line.
packages <- c('DT', 'ggiraph', 'plotly', 'tidyverse' )
for(p in packages){
if(!require(p,character.only=T)) {
install.packages(p)
}
library(p, character.only=T)
}
We use read_csv function which is part of the package in tidyverse to read the csv file. If it is an excel file, we can use read_excel etc.
exam_data <- read_csv("data/Exam_data.csv")
summary(exam_data)
ID CLASS GENDER
Length:322 Length:322 Length:322
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
RACE ENGLISH MATHS SCIENCE
Length:322 Min. :21.00 Min. : 9.00 Min. :15.00
Class :character 1st Qu.:59.00 1st Qu.:58.00 1st Qu.:49.25
Mode :character Median :70.00 Median :74.00 Median :65.00
Mean :67.18 Mean :69.33 Mean :61.16
3rd Qu.:78.00 3rd Qu.:85.00 3rd Qu.:74.75
Max. :96.00 Max. :99.00 Max. :96.00
The data are year end examination grades of cohort of Primary 3 students. Each student is represented by ID. Each student has a property of CLASS, GENDER, RACE, with their respective scores in English, Math and science.
ggplot 2 provides a very systematic way of creating visualizations.
To go through each layer, we use a simple example of building a Histogram that shows the distribution of MATH score.
# we can use this format
ggplot(data = exam_data)
# or this format
exam_data %>%
ggplot()
Both will give a blank canvas. The second format gives a certain flexibility to manipulate the data using dplyr for the graph on the fly. E.g. using filter function, or creating a temporary column just for this graph, keeping the original table intact. Professor uses the first version, however, it is my preference to use the second format throughout the rest of the code.
Aesthetics is how we want to map the attributes of data to the visual characteristics e.g. axis, colours, size, shape, transparency.
As we want to create a histogram with MATH score, we will add MATH to x.
exam_data %>%
ggplot(aes(x=MATHS))
We can see MATH score on the x-axis.
Next we add what sort of plot we want.
# for a normal bar plot
exam_data %>%
ggplot(aes(x=MATHS)) +
geom_bar()
# e.g. we want to split by fill
exam_data %>%
ggplot(aes(x=MATHS, fill=GENDER)) +
geom_histogram()
Here are the following common options available and simple customization (that I know of and would likely commonly use (something like a cheat sheet for myself)):
exam_data %>%
ggplot(aes(x=MATHS)) +
geom_histogram(bins = 20,
color = "black",
fill = "light blue")
exam_data %>%
ggplot(aes(x=MATHS, fill=GENDER)) +
geom_histogram(bins = 20,
color = "grey30")
exam_data %>%
ggplot(aes(x=MATHS, fill=RACE)) +
geom_dotplot(binwidth = 2.5, dotsize = 0.5)
exam_data %>%
ggplot(aes(x=GENDER, y=MATHS)) +
geom_boxplot() +
geom_point(position='jitter', size=0.5)
Difference between ggplot2 and ggiraph:
p <- exam_data %>%
ggplot(aes(x=MATHS)) +
geom_dotplot_interactive(aes(tooltip = CLASS, data_id = CLASS),
stackgroups=TRUE,
binwidth=1,
method='histodot') +
scale_y_continuous(NULL, breaks=NULL)
girafe(
ggobj=p,
width_svg=6,
height_svg = 6*0.618
)
Plotly has more interactivity options than ggiraph with less ‘coding’. The top right hand corner shows a panel of possible interactivity functions that can help us with our analysis.
exam_data %>%
plot_ly(x = ~MATHS,
y = ~ENGLISH)
The interactivity in plotly is different from Tableau. When we click on the categorical features in the legend. it ‘deselects’ rather than selects.
exam_data %>%
plot_ly(x = ~MATHS,
y = ~ENGLISH,
color = ~RACE)
We can also change the colour palette/scheme.
exam_data %>%
plot_ly(x = ~MATHS,
y = ~ENGLISH,
color = ~RACE,
colors = "Set1")
pal <- c("red", "blue", "green", "purple")
exam_data %>%
plot_ly(x = ~MATHS,
y = ~ENGLISH,
color = ~RACE,
colors = pal)
exam_data %>%
plot_ly(x = ~MATHS,
y = ~ENGLISH,
text = ~paste("Student ID", ID,
"<br>Class:", CLASS),
color = ~RACE,
colors = "Set1")
This is where we can add in/do the following:
You first save ggplot as an object, then use ggplotly to ‘wrap’ it. Note: Sometime ggplot2 may not be 100% compatible with ggplotly. Need to be careful.
sub
d <- highlight_key(exam_data)
p1 <-ggplot(data = d, aes(x = MATHS, y = ENGLISH )) +
geom_point (dotsize = 1) +
coord_cartesian(xlim=c(0,100),
ylim=c(0,100))
p2 <-ggplot(data = d, aes(x = MATHS, y = SCIENCE)) +
geom_point (dotsize = 1) +
coord_cartesian(xlim=c(0,100),
ylim=c(0,100))
subplot(ggplotly(p1), ggplotly(p2))